We can also pose the pathway query at population level
What is the share of cases for which \(X\) has a positive effect on \(Y\) through \(M\)?
A question about joint \(\lambda^M\) and \(\lambda^Y\) distributions
2 Updating Models
Intuition
2.1 How will we answer questions about populations?
We need to learn about those \(\lambda\)’s
About the proportions of the population with different kinds of causal effects
We will have a prior belief about those proportions
When we see data on lots of cases, we will update those beliefs about proportions
From a prior distribution over \(\lambda\) to a posterior distribution over \(\lambda\)
2.2 How do we “update” our models?
We’ve talked about process tracing a single case to answer a case-level query
Here the model is fixed
We use the model + case data to answer questions about the case
We can also use data to “update” our models
Use data on many cases to learn about causal effects in the population
Allows mixing methods: using data on lots of cases, we can learn about probative value of process-tracing evidence
The core logic: we learn by updating population-level causal beliefs toward beliefs more consistent with the data
2.3 Start with a DAG
2.4 Large-\(N\) estimation of \(ATE\): what happens to beliefs over parameters
Say we only collect data on \(I\), \(M\), and \(D\) for a large number of cases
We update on both \(\lambda^I\) , \(\lambda^M\) and \(\lambda^D\) to place more weight on values that are likely to give rise to the pattern of data we see
We will come to put more weight on a joint distribution of \(\lambda^M\) and \(\lambda^D\) in line with the data and less posterior weight on all other combinations
Question: What would you infer if you saw a very high correlation between \(I\) and \(D\) and a low correlation between \(I\) and \(M\) or a low correlation between \(M\) and \(D\)?
2.5 General procedure
Key insight:
If we suppose a given set of parameter values we can figure out the likelihood of the data given those values.
We can do this for all possible parameter values and see which ones are more in line with the data
Consider this joint distribution with binary \(X\) and binary \(Y\) from here
Y = 0
Y = 1
X = 0
\(\lambda_{01}/2 + \lambda_{00}/2\)
\(\lambda_{10}/2 + \lambda_{11}/2\)
X = 1
\(\lambda_{10}/2 + \lambda_{00}/2\)
\(\lambda_{01}/2 + \lambda_{11}/2\)
reminder: \(\lambda_{10}\) is share with negative effects, \(\lambda_{01}\) is share with positive effects…
2.6.2 Causal inference on a grid: strategy
Say we now had (finite) data filling out this table. What posteriors should we form over \(\lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}\)?
Y = 0
Y = 1
X = 0
\(n_{00}\)
\(n_{01}\)
X = 1
\(n_{10}\)
\(n_{11}\)
Lets start with a flat prior over the shares and then update over possible shares based on the data.
This time we will start with a draw of possible shares and put look for posterior weights on each drawn share.
2.6.3 Causal inference on a grid: likelihood
\[
\Pr(n_{00}, n_{01}, n_{10}, n_{11} \mid \lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}) =
f_{\text{multinomial}}\left( \alpha_{00}, \alpha_{01}, \alpha_{10}, \alpha_{11} \mid \sum n, w \right)
\] where:
\[w = \left(\frac12(\lambda_{01} + \lambda_{00}), \frac12(\lambda_{10}+\lambda_{11}), \frac12(\lambda_{10}+\lambda_{00}), \frac12(\lambda_{01}+\lambda_{11})\right)\] The weights are just the probability of each data type, given \(\lambda\).
# add likelihood and calculate posteriorx <- x |>rowwise() |># Ensures row-wise operationsmutate(likelihood =dmultinom(c(400, 100, 100, 400),prob =c(b + c, a + c, a + d, b + d) /2 ) ) |>ungroup() |>mutate(posterior = likelihood /sum(likelihood))
2.6.6 Causal inference on a grid: execution
a
b
c
d
likelihood
posterior
0.30
0.17
0.53
0.00
2.10e-212
1.72e-209
0.50
0.21
0.29
0.00
1.26e-221
1.03e-218
0.11
0.38
0.13
0.39
7.80e-46
6.38e-43
0.63
0.02
0.14
0.20
0.00e+00
0.00e+00
0.48
0.09
0.18
0.24
1.97e-231
1.61e-228
0.27
0.07
0.52
0.15
3.99e-194
3.26e-191
2.6.7 Causal inference on a grid: inferences
We calculate queries like this:
x |>summarize(a =weighted.mean(a, posterior),b =weighted.mean(b, posterior),ATE = b - a) |>kable(digits =2)
a
b
ATE
0.1
0.69
0.59
2.6.8 Causal inference on a grid: inferences
x |>ggplot(aes(b, a, size = posterior)) +geom_point(alpha = .5)
Spot the ridge
2.7 In sum: learning from data
For any data pattern, we gain confidence in parameter values more consistent with the data
For single-case inference, we must bring background beliefs about population-level causal effects
For multiple cases, we can learn about effects from the data
Large-\(N\) data can thus provide probative value for small-\(N\) process-tracing
All inference is conditional on the model
3 Mixed methods
Combining wide and deep data
3.1 A DAG
We’ll want to learn about the \(\theta\)’s and the \(\lambda\)’s
We need to observe nodes to learn about other nodes
We can potentially observe 3 nodes here: \(X, M\), and \(Y\)
3.2 A typical “quantitative” data structure
Data on exogenous variables and a key outcome for many cases
E.g., data on inequality (\(I\)) and democracy (\(D\)) for many cases
3.3 A typical “qualitative” data structure
Data on exogenous variables and a key outcome plus elements of process for a small number of cases
Finite resources mean tradeoffs between extensive and intensive data collection
E.g., data on inequality (\(I\)), mass mobilization (\(M\)), and democracy (\(D\)) for many cases
3.4 Mixing qualitative and quantitative
What if we combine extensive data on many cases with intensive data on a few cases?
A non-rectangular data structure
3.5 Non-rectangular data
A data structure that neither standard quantitative nor standard qualitative approaches can handle in a systematic way
Not a problem for the Integrated Inferences approach
We simply ask:
Which causal effects in the population are most and least consistent with the data pattern we observe?
That is, what distribution of causal effects in the population, for each node, are most consistent with this data pattern?
CausalQueries uses information wherever it finds it
3.6 Mixing in practice
For Bayesian approaches this mixing is not hard.
Critically though we maintain the assumption that cases for “in depth” analysis are chosen at random—otherwise we have to account for selection processes.
What is the probability of seeing these two cases:
Say we just observe a positive Inequality-Democratization correlation
Could be because Inequality causes Democratization
Could be because of confounding
3.9.2 How qual can inform quant: confounding
Remember
Observing \(M\) helps
Process data helps address the deep problem of confounding
Key point: we don’t need \(M\) for all cases
Can learn from \(I\) and \(D\) for lots of cases and \(M\) for a subset
3.9.3 How qual can inform quant: observable confounder
Another example: \(M\) as the confound
3.9.4 How qual can inform quant: observable confounder
How much can we learn from \(M\) data for some cases?
3.9.5 How quant can inform qual: getting probative value of a clue from the data
Suppose we go to the field and we learn that mass mobilization DID occur in Malawi
So \(M=1\)
What can we conclude?
NOTHING YET!
3.9.6 How quant can inform qual: getting probative value of a clue from the data
The pure process-tracing solution: assign our beliefs about causal effects in the population
E.g., beliefs that linked positive effects are more likely than linked negative effects
Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)
The mixed-methods solution: learn about population-level effects from large-\(N\) data
3.9.7 How quant can inform qual: getting probative value of a clue from the data
Suppose we have data on \(I\), \(D\), and \(M\) for a large number of cases
Suppose we observe a strong positive correlation across all 3 variables
What have we learned, under this model?
Positive \(I \rightarrow M\) effects more likely than negative
Positive \(M \rightarrow D\) effects more likely than negative
So linked positive effects more common than linked negative effects
Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)
But now we’ve now drawn our population-level beliefs from the data
Now, we can go and process-trace
Did high inequality cause democratization in Malawi?
Observe \(M\)
With conclusions grounded in case-level AND population-level evidence
3.9.8 Application from the book: rule-of-law institutions and long-term growth
We start with flat priors over causal types
Gather data on all nodes for many cases
3.9.9 Rule-of-law and growth: process-tracing probative value from large-\(N\) data
3.9.10 Rule-of-law and growth: learning about confounding
We allowed for confounding between rule of law and growth
Mortality’s effects on institutions may be correlated with institutions’ effects on growth
We learn about that confounding from the data
Rule of law more often has a positive effect on growth where mortality has a negative effect on institutions
Consistent with selection effects:
When mortality is low, settlers make institutional choices in anticipation of their growth effects
3.9.11 Rule-of-law and growth: learning about confounding
What this looks like in our posteriors over nodal types:
A type where RoL has a positive effect on Growth is more common when Mortality has a negative effect on RoL.
4 Mixed methods in CausalQueries
4.1 Big picture
CausalQueries brings these elements together by allowing users to:
Make model: Specify a DAG: CausalQueries figures out all principal strata and places a prior on these
Update model: Provide data to the DAG: CausalQueries writes a stan model and updates on all parameters
Query model: CausalQueries figures out which parameters correspond to a given causal query
4.2 Illustration \(X \rightarrow Y\) model
Consider this problem:
Y = 0
Y = 1
X = 0
\(n_{00}\)
\(n_{01}\)
X = 1
\(n_{10}\)
\(n_{11}\)
where \(X\) is randomized, both \(X\), \(Y\) binary
4.3 Model, update, query
data =fabricate(N =1000, X =rbinom(N, 1, prob = .5), Y =rbinom(N, 1, prob = .2+ .4*X))model <-make_model("X -> Y") |>update_model(data)
4.4 Model, update, query
model |>inspect("posterior_distribution")
posterior_distribution
Summary statistics of model parameters posterior distributions:
Distributions matrix dimensions are
4000 rows (draws) by 6 cols (parameters)
mean sd
X.0 0.48 0.02
X.1 0.52 0.02
Y.00 0.28 0.07
Y.10 0.12 0.07
Y.01 0.50 0.07
Y.11 0.11 0.07
4.5 Model, update, query
model |>grab("posterior_distribution") |>ggplot(aes(Y.01, Y.10)) +geom_point(alpha = .2)
Posterior draws
4.6 Model, update, query
model |>query_model(query =c(ATE ="Y[X=1] - Y[X=0]", POS ="Y[X=1] > Y[X=0]", SOME ="Y[X=1] != Y[X=0]" ),using =c("priors", "posteriors")) |>plot()
4.7 Generalization: Procedure
The CausalQueries approach generalizes to settings in which nodes are categorical:
Identify all principal strata: that is, the universe of possible response types or “causal types”: \(\theta\)
Define as parameters of interest the probability of each of these response types: \(\lambda\)
Place a prior over \(\lambda\): e.g. Dirichlet
Figure out \(\Pr(\text{Data} | \lambda)\)
Use stan to figure out \(\Pr(\lambda | \text{Data})\)
4.8 Generalization: Procedure
Also possible when there is unobserved confounding
…where dotted lines means that the response types for two nodes are not independent
5 Extra slides
5.1 Illustration: “Lipids” data
Example of an IV model. What are the principle strata (response types)? What relations of conditional independence are implied by the models?
data("lipids_data")lipids_data |>kable()
event
strategy
count
Z0X0Y0
ZXY
158
Z1X0Y0
ZXY
52
Z0X1Y0
ZXY
0
Z1X1Y0
ZXY
23
Z0X0Y1
ZXY
14
Z1X0Y1
ZXY
12
Z0X1Y1
ZXY
0
Z1X1Y1
ZXY
78
Note that in compact form we simply record the number of units (“count”) that display each possible pattern of outcomes on the three variables (“event”).[^1]
5.2 Model
model <-make_model("Z -> X -> Y; X <-> Y") model |>plot()
5.3 Updating and querying
Queries can be condition on observable or counterfactual quantities